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Abstract: Wc consider the problem of interaction neighborhood estima- 
tion from the partial observation of a finite number of realizations of a 
random field. We introduce a model selection rule to choose estimators of 
conditional probabilities among natural candidates. Our main result is an 
oracle inequality satisfied by the resulting estimator. We use then this selec- 
tion rule in a two-step procedure to evaluate the interacting neighborhoods. 
The selection rule selects a small prior set of possible interacting points and 
a cutting step remove from this prior set the irrelevant points. 
We also prove that the Ising models satisfy the assumptions of the main 
theorems, without restrictions on the temperature, on the structure of the 
interacting graph or on the range of the interactions. It provides therefore 
a large class of applications for our results. We give a computationally effi- 
cient procedure in these models. We finally show the practical efficiency of 
our approach in a simulation study. 



1. Introduction 

Graphical models, also known as random fields, arc used in a variety of domains, 
including computer vision [4, 21], image processing [9], neuroscience [19], and 
as a general model in spatial statistics [18]. The main motivation for our work 
comes from neuroscience where the advancement of multichannel and optical 
technology enabled the scientists to study not only a unit of neurons per time, 
but tens to thousands of neurons simultaneously [20] . The very important ques- 
tion now in neuroscience is to understand how the neurons in this ensemble 
interact with each other and how this is related to the animal behavior [8, 19]. 
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This question turns out to be hard for three reasons at least. First, the experi- 
menter has always only access to a small part of the neural system. Moreover, 
there is no really good model for population of neurons in spite of the good 
models available for single neurons. Finally, strong long range interactions exist 
[15]. Our work tries to overcome some of these difficulties as will be shown. 

A random field can be specified by a discrete set of sites G, possibly infinite, a 
finite alphabet of spins A, and a probability measure P on the set of configura- 
tions X(G) = A G . One of the objects of interest are the one-point specification 
probabilities, defined for all sites i in G and all configurations x in X{G) by a 
regular version of the conditional probability 

P(x(i)\x(j), jeG/W). 

From a statistical point of view, two problems are of natural interest. 

Interaction neighborhood identification problem (INI): 

The INI problem is to identify, for all sites i in G, the minimal subset Gi 
of G necessary to describe the specification probabilities in site i (see Sections 
2 and 3 for details). G» is called the interaction neighborhood of i and the 
points in G; are said to interact with i. Gi is not necessarily finite but only 
a finite subset Vm C G of sites is observed. The observation set is a sample 
Xi- n (V M ) = (Xi(j),..., X n (j)) je v M i where (Xi,...,X n ) are i.i.d with common 
law P. The question is then to recover from Xi : „,(Vm), for all i in Vm, the sets 
G i n Vm ■ 

Oracle neighborhood problem (ON): 

The ON problem is to identify, for all i in G, a set Gi = G;(A"i : „(Vm)), such 
that the estimation of the conditional probabilities P (x(i)\x(j), j € G/{i}) by 
the empirical conditional probabilities P(x(i)\x(j),j £ Gi) has a minimal risk 
(see Sections 2 and 3 for details). Gi is then said to satisfy an oracle inequality 
and it is also called oracle. We look for oracles among the subsets of Vm and 
we consider the Loo-distance between conditional probabilities to measure the 
risk of the estimators. An oracle is in general smaller than Gi because it should 
balance approximation properties and parsimony. 

The literature has mainly been focused in the INI problem, see [3, 7, 10, 11, 17] 
for examples. It requires in general strong assumptions on P to be solved. For 
example, the ^-penalization procedure proposed in [17] requires an incoherence 
assumption on the interaction neighborhoods that is very restrictive, as shown 
by [3]. Moreover, it is assumed in [3, 7, 17] that G is finite and that all the 
sites are observed, i.e. Vm = G. Csiszar and Talata [10] consider the case when 
G = Z d but assume a uniform bound on the cardinality of G;. The procedure 
proposed in [11] holds for infinite graph with each site having infinite neighbor- 
hoods, but requires that the main interactions belong to a known neighborhood 
of i of order 0(ln n). Moreover, the result is proved in the Ising model only when 
the interaction is sufficiently weak. 

The first goal of this paper is to show that the ON problem can be solved 
without any of these hypotheses. We introduce in Section 3.2 a model selection 
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criterion to choose a model Gi and prove that it is an oracle in Theorem 3.2. 
This result does not require any assumption on the structure of the interaction 
neighborhoods inside or outside Vm- 

The second objective is to show that a selection rule provides also a useful 
tool to handle the INI problem. We introduce the following two steps procedure. 
First, we select, for all sites i in Vm, a small subset Vj of Vm with the model 
selection rule. We prove that this set contains the main interacting points inside 
Vm with large probability. Following the idea introduced in [11], we use then a 
test to remove from V the points of (G/Gj) n V. The new test can be applied 
to all neighborhoods V that are smaller than O(lnn) and that contain the 
main interaction points in Gi. It requires less restrictive assumptions on the 
interactions outside V, and on the measure P than the one of [11]. For example, 
it works in the Ising models without restrictions on the temperature parameter. 
Furthermore, the two-step method let us look for the interacting points inside 
all the observation set Vm (of order 0{e nfi ) for some < j3 < 1), and not only 
inside a prior subset Vi (smaller than O(lnn)) of Vm- 

All the results hold under a key assumption HI that is not classical, but that 
is satisfied by Ising models, see Theorem 4.5. We obtain then a large class of 
models, widely used in practice, where our methods are efficient. We also provide 
for this model a computationally efficient version of our main algorithms. 

The paper is organized as follows. In Section 2, we introduce notations and 
assumptions used all along the paper. Section 3 gives the main results, in a 
general framework. Section 4 shows the application to Ising models and Section 
5 presents a large simulation study where the problem of the practical calibration 
of some parameters is adrcsscd. Section 6 is a discussion of the results with a 
comparison to existing papers. Section 7 gives the proofs of the main theorems 
and some technical results are recalled in an appendix in Section 9. 

2. Notations and Main Assumptions 

Let G be a discrete set of sites, possibly infinite, A = { — 1,1} be the binary 
alphabet of spins, and P be a probability measure on the set of configurations 
X(G) = A G . More generally, for all subsets V of G, let X(V) = A v be the 
set of configurations on V. In what follows, the triplet (G, A, P) will be called a 
random field. For all i in G, for all V C G, for all x in X{G), let x(V) = {%(j))je.v 
and for all probability measures Q on X(V U {«}), let 

Qi\v(x) = Q(x(i)\x(V/{i})) 

be a regular version of the conditional probability. All along the paper, we will 
use the convention that, if V is a finite set, Q a probability measure on X(V) 
and a; is a configuration such that Q(x{V/ {%})) = 0, then Qi\v{x) = 1/2. 
For all x in X(G) and all j in G, let Xj be the configuration such that Xj(k) = 
x(k) for all k ^ j and Xj(j) = —x(J). Wc say that there is a pairwise interaction 
from j to i if there exists x in X{G) such that Pi\Q(xj) ^ Pi\c(x). For all subsets 
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V of G, for all probability measures Q on X(V), let 



ho 



(Q) = sup {Qi\ v {x) - Qi\v{xj)} 
xex(G) 



With the above notations, there is a pairwise interaction from j to i if and only 
if wfj(P) > 0. Our second task in this paper is to recover the set Gi of sites 
having a pairwise interaction with i. This definition differs in general from the 
one suggested in introduction. However, it is easy to check that they coincide 
in the Ising models defined in Section 4. 

Let Xi :n = [X\ , X n ) be i.i.d. with common law P. Let Vm be a finite subset of 
G of observed sites, with cardinality M. The observation set is then Xi : „(Vm) = 
(X\(Vm), -Xti(Vm))- Let P be the empirical measure on X(G) defined for all 
configurations x in X{G) by 



P(x) 



1 ^ 



(G)=x(G)}- 



( = 1 



For all real valued functions / defined on ^(G), let ||/| 



su Pxex(G) \f(x)\- 



P 



i\V 



i\G 



This 



For all subsets V of Vm, the L^-risk of P^y is defined by 

risk is naturally decomposed into two terms. From the triangular inequality, we 
have 



P 



IV 



P 



\G 



< 



Pi\V " P*\V 



P 



\v 



We call variance term the random term 



P 



\v 



P 



\v 



and bias term the 



deterministic one \\Pi\v ~ P%\g\\ - 

Let us finally present our general assumptions on the measure P. In the following 
v and >t m in are positive constants. The two first assumptions are classical and 
will only be used to discuss the main results. 

NN: (Non-Nullness) For all x in X(G), v^ 1 < P t \ G {x). 



C A: (Continuity) For all growing sequences (V n ) ne f 
U n gN* V„ = G, for all i in G, 



of subsets of G such that 



lim \\P 



\v„ 



i\G\ 



0. 



The following last assumption is very important for the model selection criterion 
to work. It is satisfied for example by a generalized form of the Ising model as 
we will see in Section 4. 



HI: For all finite subsets V of G, 

«min ||-Pi|G ~ AlvHoo 



P 



\Cr\ 



~ P 



\y\ 
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3. General results 

3.1. Control of the variance term of the Loo-risk 

Our first theorem provides a sharp control of the variance term of the risk of 
Pi\v- It holds without assumption on the measure P or the finite subset V. 

Theorem 3.1. Let P be a probability measure on X(G), let V be a finite subset 
of G. Let pX_ = va£ X £X(G) , p(x(V))jto P( X (V))- There exists an absolut constant 
c\ such that, for all 5 > 1, 



P 



\v 



l\V 



l\n.(8/P-) 
npY_ 



< 



(1) 



Moreover, let p_ — n 1 V inf X £x(V) P(x{V /{%])). There exists an absolut con- 
stant C2 < 400 such that, for all S > 1, 



P 



P t \ v (x) - Pi\ V (x) 



> C2\ 



l \n(6n) 
npY_ 



1 

< - 

" 5 



(2) 



Remark: 



Let \V | denote the cardinality of V, if P satisfies NN we have pY^ > v ' y][ . 
Hence, (1) implies that, 



P 



Pi\V " Pi\V 



|V|ln(i/)+21n(n) 



> 1 -n 



The variance term goes almost surely to if v' v \ << n(lnn) 1 . If in 
addition P satisfies CA and (Ki)neN* is a growing sequence of sets with 
limit G, the estimator Pi\v n is consistent. 
• (1) is only interesting theoretically, because the parameter p[_ is unknown 
in practice. We will use (2) for our model selection algorithm. 



3.2. Model Selection 

We deduce from Theorem 3.1 that the risk of the estimator P^y is bounded in 
the following way. For all 6 > 1, for all subsets V, 



P 



i\G 



P 



i\V 



< P 



i\G 



i\V\ 



C2\ 



'ln((5r 
npY. 



>i-s-\ 



(3) 



The risk of P^y depends on the approximation properties of V through the bias 
\\p\G ~ P\v\\ that is typically unknown in practice, and on the complexity of 
V, measured here by p_. The aim of this section is to provide model selection 
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procedures in order to select a subset of Vm that optimizes the bound (3). In the 
following, we denote by Q n a finite collection of subsets of Vm, possibly random, 
and we call optimal or oracle in Q n , any subset G = G{Xi :n {{Jy e g n V)) in Q n , 
possibly random, such that, 



Pi\G ~ Pi\d 



< K inf 

oo ves 




i\V\ 



'ln((5n) 



> 1 



We introduce the following selection rule. Let N n be an almost sure bound on 
the cardinality of \Q n \. For all S > 1 and for all C > C2, let 



G(C,6,g n ) = argmini 



P, 



IV 



Cpcn(V) | , where pen(V) > 



l\n(SnN n 



(4) 

The following theorem states that G(C,6,Q n ) is almost an oracle. 

Theorem 3.2. Let P be a probability measure on X(G) satisfying HI. Let 
Q n be a finite collection of finite subsets of G, possibly random, and let N n 
be an almost sure bound on the cardinality of Q n . For all C > C2, 5 > 1, 
let Gs(C) = G(C,5,G n ) be the estimator given by (4)- There exists a positive 
constant K = K(c2, C, ft m in) such that, 



P 1 1 i\&s{C) 

Remarks: 



P 



\G 



<^nf n {||^|G-^|v| 



pen(F)} ) > 1 - 



1 



• Theorem 3.2 states that the risk of the estimator selected by the rule (4) 
is the best among the collection Gn ■ It is the main result of the paper and 
we will discuss in what follows several applications. 

• The key idea of the proof is that, by assumption HI, we have H-FjIgHoo — 
H-PilvHoo — \\Pi\G ~~ Pi\v\\ , bence, our decision rule consists essentially 
in minimizing the sum of the bias term and the variance term of the risk, 
and the selected estimator is then an oracle. 

• The constant C2 derived in Theorem 3.1 is very pessimistic. Hence, Theo- 
rem 3.2 is more interesting theoretically. In the simulations of Section 5, 
we will calibrate C with the slope algorithm introduced in [5] and illustrate 
the nice properties of the resulting Gg(C). 

Let us go back to the ON problem. It is solved thanks to the following corollary. 

Corollary 3.3. Let (G,A,P) be a random field. Let Vm is a finite subset of 
G with cardinality M , let 5 > 1 and let Fj\/(<5) = ln(n)(l + log 2 (M)) + ln(5). 
For all m, e < m < M, let Gm,M = {V C V M , \V\ < m} . For all V C V M , let 
pcn(^) = {npZy 1/2 ^T M {5), let G S (C) = G(C,5,Gi OS2(n) , M ) be the estimator 
given by (4). With probability larger than 1 — S~ , we have 



P t\G 5 (C)) P i\G 



< K inf 

VCV M 



p 



\G 



P 



\V\ 
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Remarks: 

• The complexity of the model selection algorithm for the collection Q m ,M 
is 0(nM m ). This collection is used when a uniform bound m on the car- 
dinalities of the | Gi| is known. The complexity is the minimal necessary 
to recover the interaction graph in this problem [7]. 



3.3. Estimation of the interaction subgraph 

Let M be an integer and let Vm be a finite subset of G, with cardinality M. 
For all subsets Vm of G, let us choose v v (S) > ^/ln(<5n)(np^) -1 / 2 . Let V be a 
finite subset of G, we study in this section the estimators of Gi given by 

GY(c) = {jeV, W £.(P) >«£(*)}. (5) 

We introduce the following function. 

V, p_ >v z 

^ represents the minimal value of the bias term at a given value of the variance 
term. Our assumption concerns the rate of convergence of to 0. 

H2(e!f): There exist G* > 0, > such that, for all K > 1, for all v > 0, 
P(V(Kv) < C 9 K- a *y(v)) > 1-e*. 

Theorem 3.4. Let (G,A,P) be a random field satisfying HI, H2. Let e< M 

be an integer, let Vm be a finite subset of G with cardinality M. Let 8 > 1 and 
let T M (S) = In(n)(l+log 2 (M))+m(<5). Let Q n = {V C V M , \V\ < (Iog 2 n)}. 
For all V in Q n , let 



l T M (S) 
npY_ 



Let C > C2, pen(V) = v„(S) and let G$(C) = G(C,8,Q n ) be the set selected 

by the selection rule (4)- Let c > and let G i (c) be the associated set 
defined by (5). Let K be the constant defined in Theorem 3.2 and let Coo = 



2(c 2 + G~ 1/Q *(2X) 1 - 1 /^). 



We h 



ave 



P\ 



({j£V M , ^(P)>( c + Coo )^(<5)} 
C Gf * (C) (c) C { j G V M , w&(P) > (c- Coo )vi(S) }) > 1 - 5~ l - e v . 
Remark: 

• When c > Coo, Gf s< " C \c) contains exactly the sites that have a pairwise 
interaction with i of order the risk of an oracle. It provides a partial 
solution to the INI problem. 
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• Theorem 3.4 requires the extra assumption H2 compared to Theorem 3.2. 
Moreover, the theoretical constant Coo depends on the constants « m i n , , 
ay. 

Let us conclude this section with the two steps algorithm suggested by Theorem 
3.4 to estimate G % = {j G G, tof^P) > 0}. 

Estimation algorithm: 

• Choose a large subgraph Vm of G, typically the M nearest neighbors of i 
in G. 

• Selection step. Choose a model G, applying the model selection algo- 
rithm of Theorem 3.2 to the collection of all subgraphs of Vm with cardi- 
nality smaller than log 2 (n). 

• Cutting step. Cut the edges of G such that ujf^{P) > c^v^ ■ 

4. Ising Models 

The remaining of the paper is devoted to Ising models. These models are very 
important in statistical mechanics [12] and ncuroscience [19] where they repre- 
sent the interactions respectively between particles and neurons. In this section, 
we prove that Ising models satisfy HI, so that all our general results apply in 
these models. We also define effective algorithms for the ON and INI problems, 
adapted to this special case. 

4-1- Verification of HI . 

Let us recall the definition of Ising models. 

Definition 4.1. Let f : G 2 x A 2 —> K, (i,j,a,b) fi.j(a,b) be a real valued 
function. For all i,j in G and all a in A, let ||/fj|| = maxbeA \fi,j( a ji>)\. f is 
said to be a pairwise potential of interaction if for all a,b in A, fi_i(a,b) = 
and if 

r := sup sup Y] \\f^\\ < oo. 

iGG a£A 

In this case, T — is called the temperature parameter of the pairwise poten- 
tial f . 

Definition 4.2. A probability measure P on X{G) is called an Ising model with 
potential f if, for all x € X{G), 

The existence of a such a measure is well known [12]. 
Remark: 
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• The classical Ising model has potential / defined by fij(a,b) = Jijab + 
Hial{ i= jy, Ji 3 £ R, Hi £ R, for all a, b £ A and i, j e G. 

• One of the fundamental questions studied for this class of models is the 
description of conditions on potential / that guarantees uniqueness and 
non-uniqueness of the Ising model. Usually, high temperature implies con- 
ditions for the uniqueness of the Ising model and low temperature implies 
non- uniqueness [12]. 

Let gi t j(a, b) = fij(a, b) — fi,j(—a, b), we have then 

1 

It is clear that Ising models satisfy CA and NN with v = (1 + e 2r ) _1 . 

Definition 4.3. Let (G,A,P) be an Ising model, with potential f. For all i,j 

in G, for all a in A, let 

u %,j(f) = SU P {9i,j( a ib) - 9i,j(a,-b)} = Bvp{gij(a,b) - gi,j(a,-b)} . 

(a,b)eA 2 beA 

Let us first recall some elementary facts about Ising models. 

Proposition 4.4. Let (G,A,P) be an Ising model, with potential f. For all 
finite subsets V of G, for all i,j in G, we have 

1. p v _ > (l + e 2r )-\ v \. 

2. < w &.(p) < /; ( ;y;_- i ) u , J (/). 

The following theorem states that all of our general results apply in Ising models. 
The key ingredient of the proof is the precise control of the bias term (6). 

Theorem 4.5. Let (G,A,P) be an Ising model, with potential f. There exist 
two positive constants c* < C* such that, for all subsets V of G, 

4 £ £*,,•(/) < || P| G - PiwlL < c; £ (6) 

P satisfies assumption HI i.e. there exists a constant K m - m > such that, for 
all finite subsets V of G, 

Kmin \\Pi\G - P|V|L < HPigH^ ~ H-PtlvIL • 

4-2. A special strategy for Ising models 

The model selection algorithm (4) might be computationally demanding in 
practice when the collection Q n is too large. This is the case of the collection 
G\og M,M used several times in Section 3, when the values of M and n are large. 
The purpose of this section is to show that a special strategy, computationally 
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more attractive, can be adopted in Ising models. The idea comes from [7]. Let 
us describe the method. 

Reduction of the number of sites. Let x\ be the configuration in X(G) such that, 
for all j in G, x\{j) = 1. 

Step 1 Computation of the empirical probabilities. For all j in Vm, let 

p(j) = P(xi(j)), p(i,j) = P(x 1 (i,j)). 
Step 2 Reduction step. We keep the j in Vm such that 

\p(i,j) -P(i)p(j)\ > V- 



Let also r) ms be the smallest r\ > 3y (2n) 1 ln(6Af S) such that the number 
of j kept after Step 2 is smaller than ftlog 2 (n). 

We denote by V{rf) the set of j kept after Step 2. It is clear that the reduction 
algorithm has a complexity 0{nM). Remark that the values \p(i,j) — p[i)p{j)\ 
do not depend on the configuration X\ since the alphabet has only two letters. 

Model selection algorithm. Let Q = ^ V C V(r] ms ) |. 

Step 1 Computation of the conditional probabilities. For all V in V{rj ms ) 1 com- 
pute ||Pj|y||, and pen(V^). 
Step 2 Selection Step. We choose C > C2 and 



G = ar g min|-||^||+^^|. 
It is clear that, if fh — |y(// ms )| < ft(log 2 (n)), hence 

m 

N =\g\=Y,C m <2 ffl <n K . 

k=0 

Hence, the complexity of the model selection algorithm is 0(n K+1 ). The global 
complexity of the algorithm is therefore 0(n K+1 + nM). As a comparison, the 
model selection algorithm for Q n — G\ OS2 (n).M was 0(nM + n log2 ( M )). 

4-2.1. Control of the risk of the resulting estimator 

Theorem 4.6. Let (G,A,P) be an Ising model, with potential f. Let 

_ 4r(l + e 2r ) 3 _ 4r(l + e 2r ) 2 
1 ~ e^ r (e 4r - 1)' 2 ~ e 6r {e 4r - 1)' 
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With probability larger than 1 — S we have that 



jeV M , M f)\>C 1 [r l + 3 ] ^ 6MS ^ 



2n 



c V( V ) v M - |^,(/)| > c 2 ( r, - 3\^f^ j 



Furthermore, let us denote by 

V(S,M) = [j e V M ; kjCf)! < Ci(77 ms +3V(2n)- 1 ln(6MJ)} 
W^ii/i probability larger than 1 — 25, we Ziawe, 
1 



A" 



j6V(«,A0 U'ev(»7)/v 



Remarks: 



• The estimator of the interaction graph has better properties than the one 
obtained with selection and cutting procedure. The main difference is that 
there is no term (p^) -1 / 2 in the rate of convergence. 

• The oracle inequality might be a little bit less sharp than the one ob- 
tained in (19). This is the price to pay to have a computationally efficient 
algorithm. 

• Our result holds in the Ising model. However, [7] used a similar approach 
in more general random fields with some additional assumptions and ob- 
tained good properties for the INI problem. 

5. Simulation studies 

In this section we illustrate results obtained in Sections 3 and 4 using simulation 
experiments and introduce the slope heuristic. All these simulation experiments 
can be reproduced by a set of MATLAB® routines that can be downloaded 
from www.princeton.edu/^ dtakahas/publications/LTlOroutines.zip. 
Let G = {-1,0,1} x {-1,0,1}. For the sections 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 
and 5.7, we consider an Ising model on A G , with pairwise potential given by 
fij(c, d) = Jlj£Vi c d for i, j e G, c, d e A, J = 0.2, and Vi C G. The pair of sites 
where j £ Vi is shown in Figure 1. For all these experiments, i = (0, 0). We 
simulated independent samples of the Ising model with increasing sample sizes 
n = 100/s, k = 1, . . . , 100. For each sample size we have N = 100 independent 
replicas. 

5.1. Variance term of the risk 

In the following experiment we will verify Theorem 3.1 in a simulation. For each 



sample size we computed the normalized variance term y/n 



Pi\Vi — P%\Vi 



for 

oo 
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Fig 1 . Representation of the interacting pairs of the Ising model used in the simulation 
experiments. The edges between sites indicate the interacting pairs. The grey colored 
edges indicate the sites interacting with site (0, 0) . 



N different samples and obtained the average value. The result is summarized 
in Figure 2. 

I 



7- N 




2 1 1 1 ■ 1 1 1 1 ■ 1 1 

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

Sample size 

Fig 2. Plot of the number of samples n against \fn \Pi\Vi — Pi\Vi ■ The dotted line in- 
dicates the linear regression line. Observe that the regression line is essentially parallel 
to the abscissa. 
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5.2. Slope heuristic 

The constant C2 derived from Theorem 3.1 is too pessimistic to be used in 
practice. The purpose of this section is to present a general method to design 
this constant. It is based on the slope heuristic, introduced in [5] and proved in 
several other frameworks in [1, 14]. We refer also to [2] for a large discussion on 
the practical use of this method. In order to describe it, let us introduce, for all 
V in Q m ,M, a quantity Ay, possibly random, measuring the complexity of the 
model V. The heuristic states the following facts. 

1. There exists a positive constant C mm such that when C < C m - ln , the 
complexity of the model selected by the rule (4) is as large as possible. 

2. When C is slightly larger than C m i n the complexity of the selected model 
is much smaller. 

3. When C = 2C m i n then the risk of the selected model is asymptotically the 
one of an oracle. 

The heuristic yields the following algorithm, denned for all complexity measures 
Ay. 

1. For all C > 0, compute Ag, c y the complexity of the model selected by 
the rule (4). 

2. Choose Cmin such that A^v^ i s very large for C < C min and much smaller 

for C > C'min. 

3. Select the final G = G(2C mm ). 

The algorithm is based on the idea that C m ; n ~ C m i n and therefore that 
the final C, selected by 2C , m ; n Ay is an oracle by the third point of the slope 
heuristic. The actual efficiency of this approach depends highly on the choice of 
the complexity measure Ay and on the practical way to choose C m i n in step 2 of 
the algorithm. We illustrate the dependence on Ay in the following experiences. 

Ay is either the cardinality of V (the dimension) or the variance estima- 
tor C(np^) -1 / 2 . C'min is selected with the maximum jump criteria [2]: fix an 
increasing sequence of positive numbers Co, . . . , Ct and define 

k = argmax |a 6(Ci) - A^.^j , and C min = C k . 

If the maximum is achieved in more than one value, take the biggest of such k. 

Remark: The calculation of C m in does not yield a significant increase of compu- 
tational time compared to the evaluation of the model selection criteria for one 
fixed constant C. The only additional cost is due to the fact that one has to keep 
in the computer memory the conditional probabilities that must be computed 
only once. 
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5. 3. Oracle risk compared to the risk of the estimated model 

One way to verify the performance of the slope heuristic proposed in previous 
section is to compute the ratio 





™i|G(2<5 min ) M|G 


oo 


infycG 


Pi\V - P t \G 


oo 



(7) 



With a reasonable procedure, we expect that the above quantity remains bounded 
We applied the model selection procedure (4) with slope heuristic discussed 
above for the set {V c G\{i} : |V| < 8}. For each sample size we computed 
the ratio (7) for 100 different samples and we obtained the average. The result 
is summarized in Figure 3. 

3.2 f 




1 4 I I I I I I I I I I I 

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

Sample size 



Fig 3. Plot of the number of samples n against the average of ratio (7). Observe that 
the risk ratio remains bounded for both the variance (solid black) and the dimension 
(dashed grey) as the measure of complexity. 



5.4- Discovery rate of the model selection procedure for ON problem 

Another way to measure the performance of our model selection procedure is to 
compute the positive discovery rate 

_ |G(2C min )nG 

oracle \ 

\G oracle \ 



(8) 



M. Lerasle and D. Y. Takahashi/ Interaction Neighborhood Estimation 



15 



and the negative discovery rate 

'\G\ (G(2(7 min )UG 



E 



oracle 



\G\G. 



oracle \ 



(9) 



with respect to the oracle G ora cie- 

We estimated (8) and (9) and the result is summurized in Figure 4. 




4000 5000 6000 

Sample size 

Fig 4. Plot of positive and negative discovery rates with respect to the oracle against 
the sample size n. In solid/dashed black lines are represented the positive/negative 
discovery rates using the variance (V) as the complexity measure and in solid black/grey 
lines the positive/negative discovery rates using the dimension (D). Observe that the 
variance gives a better positive and negative discovery rates with respect to oracle when 
compared to the dimension. 



5. 5. Performance of the model selection procedure for INI problem 

A natural question is how well the proposed model selection procedure behaves 
for the INI problem. Observe that the model selection procedure was designed 
to solve the ON problem and in principle does not necessary work for the INI 
problem. To investigate this question for each sample size we estimated the 
positive discovery rate 

r |G(2Cmin)n^r 
\Vi\ 
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and the negative discovery rate 

'\G\(G(2C min )UV l ) 



\G\K\ 



with respect to the interaction neighborhood Vi. The result is summurized in 
Figure 5. 




Fig 5. Plot of positive and negative discovery rates with respect to Vi against the sample 
size n. In solid/dashed black lines are represented the positive/negative discovery rates 
using the variance (V) as the complexity measure and in solid black/grey lines the 
positive/negative discovery rates using the dimension (D). Observe that the variance 
gives higher positive discovery rates than the dimension as the measure of complexity 
although the negative discovery rates are the same. 



5.6. Relationship between the INI and ON problems 



Another interesting question is to understand what is the relationship between 
the INI and ON problems. Useful quantities for this are the positive discovery 
rate 

\G or acle H Vi\ 



E 



IK: I 



and the negative discovery rate 

\G \ (G oracle U Vi 



E 



\G\Vt\ 



(10) 



(11) 



We estimated these quantities and the results are summarized in Figure 6. 
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0.7 



0.65 ■ 

0.6 

0.55 I 1 1 1 1 1 1 1 1 1 1 

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

Sample size 

Fig 6. Plot of positive and negative discovery rates of the oracle with respect to Vi 
against the sample size n. The solid black line represents the results for positive discov- 
ery rates and the dashed grey line represents the results for the negative discovery rates. 
Observe that in this example the oracle G orac i e matches the interaction neighborhood 
Vi quite fast. Also observe that in this example the oracle never included interactions 
not contained in Vi. 



5.7. Select and cut procedure 

Here we will show the usefulness of the two-step procedure introduced in The- 
orem 3.4 by an example. We consider the same independent samples used in 
previous experiments. We also consider i = (0, 0) and sample sizes n — lOOfc, 
k = 1, . . . , 100 with 100 independent replicas for each sample size. 

Let G(2C m - m ) be the subset of G chosen by first applying the model selection 
procedure for the set {V C G \ {i} : \V\ < 8}. To choose the constant in 
the model selection procedure, we used the slope heuristic with variance as the 
complexity measure. Let G(SC) be the subset of G obtained by applying to 
the subset G(2C m i n ) the cutting procedure with cv^ = 0.3(rtp^) _1 . We first 
computed the average of the risk ratio 





P i\G(SC) P i\G 


OO 


infycG 


Pi\V - P t \G 


oo 



(12) 



for each sample size and compared them with the average of risk ratio (7). The 
results are summarized in Figure 7. 

We also computed the positive and negative discovery rates of G(SC) and 
G(2C m i n ) with respect to Vi . The results are presented in Figure 8. 
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Fig 7. Plot of the number of samples n against the average of risk ratio (12) and (7). 
In solid black is represented the risk ratio for the two-step procedure and in dashed grey 
the risk ratio for the model selection procedure alone. Observe that the risk ratio of the 
two-step procedure remains closer to one when compared to the model selection alone. 



1 r 




I I I I I I I I I I I 

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

Sample size 

Fig 8. Plot of positive and negative discovery rates of G(SC) and G(2C m i n ) with re- 
spect to Vi against the sample size n. The black solid/dashed lines represent the pos- 
itive/negative discovery rates of the two-step procedure. The grey solid/dashed lines 
represent the positive/negative discovery rates of the model selection procedure alone. 
Observe that the two-step procedure has almost perfect negative discovery rates with 
incresing positive discovery rates. 
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5.8. Computationally efficient algorithm 

In this section we will illustrate the performance of the strategy introduced in 
Section 4.2 on the Ising model on A G , where G = {1, . . . , 200}, with pairwise 
potential fij(c,d) = \Jij\\j<zv i cd for i, j G G, c,d G A, Vi C G, and Jij indepen- 
dently generated from a Gaussian distribution with E[Jy] = and E[J?-] = 4. 
The pairs of sites (i,j) with j G Vi are represented in Figure 9. 

1. ......y. f . . f . -.. T s; .■ -J ^ 

, ";: n ■■iyv^^^^^■■:?:.^^:^^/::^i , ■■V■ ,v 

140 7> '.'■'■■'■■' t ' ,f i-V*. v ■ j'" > 

O ■■.■v.;V-\.^\ !■■■.■ 

j'v.^'.rv-'L'.. ■v- '' 1 -- 1 



/ h 4 ■ ■ 1 ' \ ' "h ' ■ J/ ' ■ ■ J ' ■ hi 

i-'.-j ; ^.^;V' ; '.-^::f v. / r-.'/v. 
' ■■-■v,> . , ,.■ ;■ ; 

. ■ . rv ". * t ■ . r 

20 40 60 80 100 120 140 160 180 200 



Fig 9. Representation of the interacting sites in the Ising model described in 5.8. The 
positions of the dots indicate the pair of sites for which j £ Vi. 



For this experiment i = 1 and \Vi\ = 16. We simulated independent samples 
of the Ising model with increasing sample sizes n = lOOfc, k = 1, . . . , 100. For 
each sample size we have N = 50 independent replicas. In this example, it is not 
practical to compute all candidates in collection C/8,200 whereas the algorithm 
introduced in Section 4.2 is very efficient. We illustrate its performance in the 
case where the number of sites j kept after Step 2 of the reduction step in 
Section 4.2 is 10. We denote the model chosen by this algorithm by ^efficient 
We estimated the probability that the selected model G e fg c j cm - recover the 
largest, and second, third, fourth, fifth largest interaction potentials. Formally, 
let Ji = max{\Jij\lj£Vi '■ h j S G} and J k = max{| Jij\lj eVi : i,j G G\ Jk-i}, 
for k = 1, . . . , 5. We estimated 

P ( G efficient 3 ( 13 ) 

for k = 1,...,5. The result of the simulation is presented in Figure 10. By 
Monte Carlo simulation using a sample size of 100 000 we concluded that the 
considered Ising model at site i = 1 does not satisfy the incoherence condition 
in [17]. 
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1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

Sample size 



Fig 10. Plot of the number of samples n against the probability that G e ffi c i en t includes 
the largest (solid black), and second (dashed black), third (solid gray), fourth (dashed 
gray), fifth (solid light gray) largest interaction potentials. Observe that the model se- 
lection procedure includes the sites with larger interaction potentials more often. 



6. Discussion 

We introduced a model selection procedure for interaction neighborhood esti- 
mation in partially observed random fields. We prove that the proposed rule 
satisfies an oracle inequality. The results hold under general assumptions, which 
for instance, are satisfied by a generalized form of the Ising model. 
Our model selection approach differs from other works [3, 7, 10, 11, 17] where 
only the INI problem is considered and more restrictive conditions are assumed. 
In particular, [3, 7, 17] consider the INI problem for finite random fields and as- 
sume that all the interacting sites are observed. This assumption is quite strong 
from practical point of view, e.g. in neuroscience, where the experimenter never 
has access to the whole set of neurons. Our result holds for partially observed 
random fields without any restriction on the range of the interactions. 
Csiszar and Talata [10] consider a BIC like consistent model selection procedure 
for homogeneous random fields in 1 d . As they consider the INI problem using 
only one realization of the random field, their result are not immediately com- 
parable with ours, but it is interesting to note that they assume that the range 
of interaction is finite. 

In [11], it is considered the INI problem for infinite range Ising models in 1 d . 
The main restriction in this last work is that it is assumed that the interactions 



M. Lerasle and D. Y. Takahashi/ Interaction Neighborhood Estimation 21 

between the sites are weak ("low tempterature" ) and that a subset of the ob- 
served sites of size 0(log(n)), where n is the sample size, must be fixed to apply 
the proposed procedure. Our procedure has no restriction on the strength of 
interaction and can be applied for example for low temperature Ising models, 
provided that the samples come from the same phase. Moreover, the model se- 
lection procedure can be applied in high dimension situation when the subset 
of observed sites is of size 0(n a ), a > 0. 

We also introduced a two-step procedure in which the model selection rule gives 
us a small set of candidate sites and a cutting procedure removes from this 
set the irrelevant interactions. This two step procedure can be understood as 
a combination of a model selection and a statistical test procedure in spirit of 
[22]. 

Our first simulation experiment shows that the concentration bound for the 
variance term of the risk in Theorem 3.1 is sharp. We propose a slope heuristic 
with maximal jump criteria using the variance or the dimension as a measure of 
complexity to choose a good constant in the model selection procedure. In our 
simulation experiment, we measured the performance of the slope heuristic for 
ON and INI problems. We observed that the variance had a better behavior as 
a complexity measure than the dimension because 1) the risk ratio was always 
smaller for the variance compared to the dimension as the measure of complex- 
ity, although both risk ratios remained bounded, 2) the estimated positive and 
negative discovery rates with respect to the oracle were always higher for the 
variance compared to the dimension as the measure of complexity, 3) also the 
estimated positive and negative discovery rates with respect to the interaction 
neighborhood were always higher for the variance compared to the dimension 
as the measure of complexity. 

Although at this point the variance seems to be a better choice for the complex- 
ity measure, a more comprehensive study must be carried to obtain a definitive 
conclusion and wc recommend to consider both measures of complexity in prac- 
tice. We addressed also in the simulation experiments the relationship between 
the ON and INI problems and observed that for sufficiently large sample size, 
both coincide. The two-step procedure introduced in this article was applied 
in a example where it clearly enhances the performance of the model selection 
procedure for both the INI and ON problems. Recently, multistep statistical 
procedures are gaining attention [22] although only few rigorous results exist. 
Our result for the two-step procedure is a contribution for this growing field. The 
main drawback of the proposed model selection procedure is its high computa- 
tional cost which becomes prohibitive when a large number of sites are observed. 
We introduced a computationally efficient way to overcome this difficulty in the 
case of Ising model. The new procedure drastically reduces the set of models 
for which the model selection procedure must be applied, but still keeping the 
main interacting sites and a good oracle property. In the simulation experiment 
we show that the proposed algorithm has a good performance even when the 
number of the observed sites is as big as 200. It must be remarked that the Ising 
model considered for this experiment does not satisfy the incoherence condition 
[17] and therefore other computationally efficient algorithms as ^-penalizations 
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are not guaranteed to be consistent. Finally, we provide a set of MATLAB® 
routines that can be used to reproduce our experimental results and to carry 
further simulation and applied studies. 
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7. Proofs 



7.1. Proof of Theorem 3.1: 

For all x such that P(x(V/{i})) = 0, we have P(x(V/{i})) = 0, thus P^ v {x) = 

Pi\v{x). Hence, we can only consider the configurations x such that P(x(V/{i})) > 
0. Let us first provide some inequalities about conditional probabilities. 

Lemma 7.1. Let x G X(G), let V be a finite subset of G and let Q,R be two 
probability measures on X(V) such that R(x(V/{i})) > 0. 

Qi\v{x) - Ri\ V {x) 

_ Q{x{V)) - R{x{V)) + Q l]v {x) (R(x(V/{i})) - Q(x(V/{i}))) 

R(x(V/{i})) 



\Qi\v(x) - Ri\ v {x)\ < 3 sup 

XEX(G), R(x(V/{i}))^Q 



\Q(x(V)) - R(x(V))\ 



R(x(V/{i}) 

Remark: In particular, we deduce from this lemma that 

P(x(V)) - P(x(V)) 



P 



i\V 



P 



i\V 



< 3 sup 

oo x<£X(G), P(x(V/{i})^0 



P{x(V/{i})) 



The first inequality follows from the fact that R^y(x) = R(x(V))/R(x(V/{i})) 
and Q(x(V)) = Qi\y(x)Q(x{V/{i})). The second one is consequence of the first 
one and the fact that 

\R(x(V/{i})) - Q(x(V/{i}))\ < \R(x(V)) - Q(x(V))\ + \R( Xi (V)) - Q( Xl (V))\. 

The proof of (1) is concluded thanks to the following Lemma. 

Lemma 7.2. Let P be a probability measure on X(G) and let V be a finite subset 
ofG. Let X'(V) = {x e X(G), P(x(V/{i})) ? 0}, p v _ = M xeX , (G) P(x(V/{i})). 
For all S > 1, with probability larger than 1 — 8 , we have 



sup 

xeX'(G) 



P(x(V)) - P(x(V)) 



P(x(V/{i})) 



npY_ 



2048 



\n{168/pY.) 



np_ 
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Conclusion of the proof of (1). We deduce from Lemmas 7.1 and 7.2 that, 
with probability larger than 1 — S^ 1 , 



Pi\V ~ Pi\V 



< 192^2* 



np_ 



a ,i • „.,.., , ln(16<5/p^) , . „ . ln(165A>_) 

As this result is trivial when -, > 1 , we can always assume that - < 



np_ 



np_ 



1, hence that ln(16 y ) < 
for ci = 6144+ 192^, 



ln(16<5/?>! 



- , thus, with probability larger than 1—6 1 , 



Pi\v — Pi\v 



< Cn 



npY_ 



Proof of Lemma 7.2: We apply Bousquet's version of Talagrand's inequality to 
the class of functions T = {(P{x(V / {i})))~ 1 l x (y)} ■ This inequality is recalled 
in Appendix. We have v 2 < (p^) _1 , b < (p^) _1 , hence, for all 5 > 1, with 
probability larger than 1 — <5 _1 , 



sup 



P(x(V)) - P(x(V)) 



xex r (G) p(x(v/{i})) 



< 2J 



sup 



P(x(V)) - P{x{V)) 



xex r (G) p(x(v/{i})) 



'^ + 2^. (14) 



np_ 



np_ 



We apply Lemma 9.6 with A x = x{V), x e X(G), a x = [P(x(V/{i}))] _1 . We 
have 

a*= sup [WWir^T'P^ SU P [P(x(V/{i})T 2 P(x(V))<^y. 
xex'(G) V- xeX'(G) V- 

Hence, 



E 



P(x(V)) - P(x(V)) 



Jx? (G) P(x{V/{i})) 



< 



np" 



'Inl " 



1024 

np_ 



mi 4 



(15) 

Lemma 7.2 is then obtained with (14) and (15). 

Let us now turn to the proof of (2). Let V he a finite subspace of S. As (2) holds 
when pY. = it remains to prove (2) when, for all x in X(V), P(X(V)) > 0. 
This is done by the following Proposition. 

Proposition 7.3. Let P be a probability measure on X(G), let V be a finite 
subset of G. Let X n = ix £ X{G), P(x(V/{i}) ^ }, pZ = inf xe # n P(x(V)). 
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There exists an absolut constant C2 < 400 such that, for all S > 1, 
P I 3x e |Pi| V (a;) - P|v(aOI > c 2 



ln((5n) 



nP{x(V)) 



1 

< -. 

" S 



(16) 



in particular, 



Proof of Proposition 7.3. Let n > 2. 5 > 1. c 2 = 400 and let us first remark 
that we only have to prove (16) on the subset X' n C X n of all the x in X n such 
that P{x{V)) > c\ ln(£n)n _1 . Let .t in A^, then we also have P(x(V/{i}) ^ 0. 
From Lemma 7.1, we have 



|Pi|v(*)--Pi|v(aOI 



< 



P(x(F))-P( ; r(y)) 


+ 


P(x(V/{i}))-P(x(V/{i})) 


J 


P(x(V/{i})) 



From Lemma 7.1, we also have 

P(x(V)) - P(x(V)) + P(x(V/{i})) - P(x(V/{i})) 
\Pi\v(x) - Pi\ V (x)\ < 

We deduce that 



\P A v(x) ~ Pi\v(x)\ 



< 



P{x(V/{i})) V c\ ln^n- 1 
P(x(V)) - P(x(V)) + P(x(V/{i}))-P(x(V/{i})) 



P(x(V/{i})) V P(x(V/{i})) V c\ \n(Sn)n- 



Hence, using the elementary inequality a V b > V ab with a = P(x(V/{i})), 
b = P(x(V/{i})) V c\ ln(i5n)n _1 , we deduce that 



P(x(V)) - P(x(V)) + P(x(V/{i})) - P(x(V/{i})) 



\Pi\v(x) ~ Pi\v(x)\ < 

y/P(x{V/{i})) (P(x(V/{i})) V c\ \n{5n)n-^ ) 

We have obtain that, for all x in X' n , 



< 



P(x(V/{i}))\P i]v (x)-P i]v (x)\ 
P(x(V)) - P{x(V)) + P(x{V/{i}))-P(x(V/{i})) 



< 3 sup 



y/P(x(V/{i})) V c\ \n(8n)n- x 
P(x(V)) - P(x(V)) 



< 3 sup 



P(x(V)) - P{x{V)) 



xex^ y/P(x(V/{i})) V ln(tfn)n _1 *e*(G) y/P(x{V/{i})) V c\ ln(c5n)n 
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We apply Bousquct's version of Talagrand's inequality to the class of functions 

T = { / = (P(x(V/{i})) V 4 HSn)^ 1 )- 112 l x(v)) x G A-(G) } 
We have 

« 2 = sup Var(/(Xx)) < 1, 6 = sup H/l^ < Jc^lnOfa))-^. 
/e^ /e.F v 

Hence, for all e > 0, with probability larger than 1 — S^ 1 , we have 



sup y/p(x(V/{i}))\P ilv (x) - P ilv (x)\ 



< 3(1 + e)E I sup |(P„ - P)f\ ] + 3,/^) + f ! , 3 A ^ 



f€T I V n \ e J C2 ^J\x\(8n)n 



< 3(l + e)E 



( sup|(P„-P)/| j + (3V2 + - + — 



In(5) 



We apply Lemma 9.6 to the sets A x = x(V) and the real numbers a x 
(P{x(V/{i}))V cll^njn^Y 1 ' 2 . We have 



a* < tJu{4 ln^n))" 1 p* = sup {P{x{V / {i})))- 1 P{x(V)) < 1. 



ire* (6?) 

Hence, 



In 

2048 — 

! - v /n(ln((5n)) 



E ( sup |(P„ - P)f \ ) < -^Wln ( Wn(c| ln(<5n))- ] 



(4Vn(clln(fa))-r) < / | 2Q48 x 

c 2 \/7i(ln((Sn))- 1 ~ V c 2 / V 71 



Thus, for all e > 0, with probability larger than 1 — 6 1 , we have 
sup V^S07W))l^|v(a:) - P|v(z)| 



We take e = 0.001 to conclude the proof. 
7.2. Proof of Theorem 3.2: 

It comes from Theorem 3.1 that, for all subsets V in Q n , we have, 
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We use a union bound to get that, 



P W e Gn, 



P%\V — Pi\V 



< C 2 4 



lhi{N n Sn) 



>1-S~ 1 . 



Hereafter in the proof of Theorem 3.2, we denote by — y \n(N n Sn) ( np^_ ) 1 
and by 



n = \yv &g n , 



Pi\V ~ Pi\V 



< C2V, 







We have proved that P(£l) > I — 6 1 . Let C > c 2 and denote, for short G 
G(C, 5, G n ). By definition of G, for all V G Q n , 



P 



\G\ 



P. 



\G 



Cv° < IIP, 



|G| 



P 



\V 



+ Cpen(y). 



Hence, on il, for all V in Gr, 



i\G\ 



i\G 



(C - c 2 )v° < \\Pi\a\L ~ \\ P i\v\L + ( C + c 2)pen(n 



From Assumption HI, uPj 



\Cr\ 



i\G 



the triangular inequality, , HPslc^ ~ ll-^Vlloo - 
these inequalities in (17), we obtain that, for all V € Q. 



P i\G ~ P i\G 
Pi\G — Pi\v\ 



(17) 



and from 
Plugging 



Pi\G - P l \ G 



+ (C - ca)t£ < ||P|g - iVL + (C + c 2 )pen(F). (18) 
On O, for all V € <? n , we have then 

< 



P i\G ~ P *\G 



P i\G P i\G 



< max 



oo 

(—,-*-)( 

V «min C — C2 J V 



Pj| G - P\G 



< + 



Pi\G - Pi\ G 



P *\G - P|G 



+ (C-ca)i£ 



< max 



f'2 



L 'C-c 2 



|P|G-P|y|L + (C + c 2 )pen(^)) 



< X(c 2 , C, K min ) (||P| G - P i]v || + pen(F)) 



7.5. Proof o/ Corollary 3.3: 

It comes from [16] Proposition 2.5 p 20 that 

N mM = \G m , M \ = °m ^ ( & — ) - M ™ hence W^m) < mln(M). 

k=Q \ m J 
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Hence, from Theorem 3.2, with probability larger than 1 — 6 , we have 

(19) 



P i\G s (C)) P *\G 



< K inf 

veg m , 



\Pi\G ~ Pi\v\ 



/ln(nM m <5) 



npi 



For all |V| > log 2 (n), there is at least one configuration in X{V) that is not 
observed, hence p^_ = 1/n. Therefore, for all m > log 2 (n), 



inf 

V601og 3 (n),M 



\Pi\G-Pi\v\ 



l T M (S) 
npY_ 



inf 

veg m . 



\*i\G-Pi\v\\ 00 + 



< Tm(S) 
np^_ 



Taking m = M, (19) yields the corollary. 



7.4. Proof of Theorem 3-4: 

Let fl be the event defined in the proof of Theorem 3.2 for the collection Q n and 
let G = G S (C). We have P(n c ) < S' 1 and, on ft, from Corollary 3.3, 



P i\G ~ P i\G 



By definition of 4", denoting by l n = dn 1 Tm(S), we have 

v ri u { \\Pi\G PiwL + vX(S) } = inf{* (w) + vl n }. 

Let v* be the smallest solution of the equation vl n = ty(v). As is non- 
increasing and v 1 — ^ vl n is non decreasing, we have 

< inf +vl n } < 2*(u*). 

v>0 



Thus, on ft, we have 



P i\G ~ P i\G 



< 2K^(v*). 



Let f2 2 be the event defined in H2. Let r < 1 and tu in ft* = ft D f2 2 such that 
Pl.(oj) > (ru*)~ 2 . From assumption H2 applied to u = to*, A' = r _1 , 

2A*(w*)> P 4|(5 H-P ? | G >*(to*)>^V- q *( V *). 

1 OO 

Hence r > (2C^,K)~ 1 / a . Thus, on O*, we have 



^IG _ P i|G 



< 2A"Lv* < Cv 



-1/as 



(2A) 



By the triangular inequality, we have 



sup \{Pi\ v (x) - P t \ V (xj)) - (Pi\a(x) - Pi\a(xj))\ 
xex(G) 



< 2 



Pi\V — Pi\V 



+ \\Pi\v-P i \a\\ 00 ) 



M. Lerasle and D. Y. Takahashi/ Interaction Neighborhood Estimation 



28 



Hence, on VL* , 



w£-(P)-u;%-(P) 



< 2 c 2 v 



P t\G ~ P i\G 



< 2 (c 2 + G~ 1/q * {2K) x - x / a * \ v[ 



Let Coo = 2 (c 2 + C $ 1/Q4, (2K ) 1 -V«* ^ . it comes from this last inequality that, 

on n*, 

{jeV M , wg-(P) >(c + Coo )^} 

c Gf (c) c{iey M , > (c - cx,)^ } . 



7.5. Proof of Theorem 4-5: 

In all the proof, for all subsets V, V of G such that V PI V' = 0, for all (x, y) in 
x #(V), let x(V) © be the configuration on Af(V U V) such that, 

for all j in V x(V)®y(V')(j) = x(j) and for all j in V , x(V)®y(V')(j) = y(j). 
Let V be a finite subset of G and let x be a configuration on X(G). 



Pi\ G {x)-P AV {x) = J (P l]G (x) - P ilG (x(V) © y(G/V)))dP(y(G/V)\x(V/{i})) 

(20) 

From the definition of a Gibbs measure, we have 
Pi\ G {x)-P^ G {x{V)®y{G/V)) 

e -J2 jEG 9i,j(x(i),x(j)) ^Z, jiv (gi,i(x(i),x(j))-g itj (x(i),y(j))) _ j\ 



^1 + e -T. je G^.j(x(i),x(v)(By(G/v)uy)^ + e -E i60 fli,j(»W.*(i))^ 



(21) 



Hence, 



\Pi\G{x)-Pi\G{x{V)®y{G/V))\ 
< 



(1 + e- 2 '') 2 



e z; 3 'gv(s».3( x (*')' x (j)) _ f*.j( a: ( i )>i'0'))) _ i 



Let us now give the following lemma, whose proof is immediate from the con- 
vexity of x i ^ e x . 

Lemma 7.4. Por all real numbers r > 0, /or a?/ a; m [— 4r, 4r], u;e /iawe 

1 - p~ 4r e 4r - 1 

Ixl < \e x - II < 



4r 



4r 
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We deduce from Lemma 7.4 that 

\P ilG (x)-P ilG (x(V)®y(G/V))\ 
(e ir - l)e 2r 



< 



4r(l + e- 2r ) 2 
It is clear that, for all x in X(G) 



^2(9i,j(x(i),x(j)) - 9i,j(x(i),y(j))) 



52(gu(x(i),x(j))- gid (x(i),yV))) dP(y(G/V)\x(V/{i})) 

<Y,«MP{y(.3)*x{j)Wvi{m 

The upper bound comes then from the inequality P(y(j) 7^ x(j)\x(V/{i})) < 1. 

For the lower bound, let, for all a in A, x^ ax be the configuration such that, 
for all j in G, gi t j{a,x a Tiax {j)) — \\9ij\\ and let x°^ in be the configuration such 
that, for all j in G, gij(a, x^ in (j)) = mi beA 9i,j{o,, b). From Lemma 7.4, we have 



a E^ v ||9, x , ( / ) ||-9i,j( a; ( i )^(i)) _ ^ _ 



X'HV \\9i!'j ) \\-9i,j(x(i),y(j)) _ 



> X -^r— Yl hi? -9i,iW),vV))- (22) 



4r 



Finally, we have 



x(i) 



giij (x(i) >y (j))dP(y(G/V)\x(V/{i})) 



(/) 



Using successively inequalities (20), (21), (22) and (23) with x — x ma k, we obtain 



1 + e 2r ' 

x(i) 



(23) 



sup Pi\ G {x) - Pi\ v {x) 
xex(G) 



> 



(Pi\ G (x x J£) ~ P^G^HiV) © y(G/V)))dP(y(G/V)\x(V/{i})) 



1 + e- 



-fl*,i(a:(*)iV(j)) 



1 _|_ e ^-jev|| 9 i,j ' 



dP(y(G/V)\x(V/{i})) 



> 



{l + e 2r f 



.E^v 9--' 1 -ff», 3 (a:(«).J/(j)) _ 



1 dP(y(G/F)|x(y/{i})) 
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Hence, 

sup Pi\ G (x) - Pi\ V (x) 
xex(G) 



> (1 - er Ar )e- 2r 



Aril + e 2r ) 2 
(1 - e~ 4r )e~ 2r ^ 

4r .(l + e 2r)3 Z, W «^)- 



E Ikfi " 9iA*®*vV)) dP(y(G/V)\x(V/{i})) 



3<£V 



Let us now check that P satisfies assumption HI. Let a; in X(V). Using suc- 
cessively inequalities (20), (21), (22) and (23) with x = x(V) © x%$£L(G/V), we 
obtain, as in the previous proof, 

/ -i — 4?*\ — "2r 

P MG (x(V)®x%£(G/V))-P ilv (x) > \~ e i ,Z )3 E^(/)- 



I-PiIgIIoo) w e obtain 



Taking x such that Pj|y(a;) = ||Pi|y || and using that Phq(x(V)®x 
tain 

(1 — e~ 4r )e~ 2r 
P »|g|L - II^VlL - 4 r (i + e 2r)3 E w ^' (/)• 



This yields the theorem thanks to inequality (6). 
8. Proof of Theorem 4.6: 



< P( Xl (i,j)) - P(a?i(i, + |P(a>i(j)) - P(Mj))\ + \P(xiH)) ~ P(»i(<)) 

We use Hoeffding's inequality (see for example [16] Proposition 2.7) to the func- 
tions t = l Xl (ij), l Xl (i), l x 1 (j), for all x > 0, we have 



P(J(P„-P)*|>^-j<2e-*. 

Hence, a union bound gives that, on a set £1(6) satisfying P(ft(S) c ) < <5 _1 , for 
all j in Vm 



P(n(i,j))-P(a;i(i,j)) + P(xx(j))-P(xx(i)) 

P(xx(i))-P(^i(i)) 



< 3 



ln(6M£) 



2n 
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Moreover, we have 

Pi^xx) - P{x x {i)) = {P lb ( Xl ) - P m { Xl )) = (24) 
Pi\ G (xi(i,j) ® y(G/{i,j})) - Pi\ G (xi(%) © y(G/{i}))dP(y(G/{i})\ Xl (i)). 

From the definition of a Gibbs measure, we have 
Pi\ G (xi(i,j) ® y(G/{i,j})) - Pi\ G {xi{i) ® y(G/{i})) (25) 

e -J2 je c 9i,j(xi(i),y(j)) ^ e (9,,j(a;i(i),Ki(i))- 9l ,j(x 1 (i),y(i))) _ -A 

We can assume that, without loss of generality that gi t j(x\{i), x\(j)) = \\gi t j\\ 
and therefore that 



e 9i,j(zi(i),Xi(j))-gi,j(xi(i),V<j)) _ - 

It comes then from Lemma 7.4 that 

1 - e- 4r 



,Si,j(»i(i) I *iO'))-9«,j(»i(j).fO')) _ 1 



\di,j(%l{i),%l(j)) - g i j(x 1 (i),y(j))\ 

< e 9i,j{xi(i),x 1 {j))-g iJ (x 1 (i),y{j)) _ ^ 

e 4r - 1 

< 4r \9i,j(xi(i),xi(j)) - g id (xi(i),y(j))\ 



Hence 

1 - er Ar 



ir 



4r i 

< e si, 3 ^iW^i(j))-si,i(^iW.J/(j)) _ i < £__ (26) 



Using successively (24), (25), (26), we deduce that 
e" 2r fl - e~ ir ) e~ 2r (l-e~ 4r ) 



4r ( 1 + e 2r) 3 i wwi - 4r(l + e 2r ) 2 

4r(l + e- 2r ) 



< \P( Xl (i,j)) P(x^))P( Xl (j))\ < ■ > \ UiJ (f)\ . 



We conclude that, on Cl(6), 



e -2r {1 _ e - ir) \n(6MS) 

4r(l + e^)3 \<**M)\ 3^ 

e 2r (e 4r -l) /ln(6M(S) 



< 



P( a : 1 (i,i))-P(a :i (i))P(^(i)) < 4r(1 V +e „ 2r) ; 2 Kj(/)l+3y-^ 
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All the sites j such that 



4r(l + e 2r ) 3 / /ln(6M«5) ^ 



e -ar(l - e-^) 
belong to V(»y). All the sites such that 



2n / 



(/)!< 



4r(l + e- 2r ) 2 

e 2r( e 4r _ f) 



ln(6M<5) \ 
2n ] 



(27) 



(28) 



do not belong to V(rj). 

We use then Theorem 3.2 with the collection G = {V C V(r] ms ) }. Its cardinality 
is bounded by n K . There exist a constant K and an event f22(<5), with probability 
1 — S, such that, on D,2(S), 



Pt\G - Pi\G 



< K inf I \\Pav-P, 



\v - n\G\ 



np^ 



We use Theorem 4.5 to say that 



\\p i \v-Pi\G\\<c r J2\u i , j (f)\ = c r E k,-(/)|+ E Ki(/)l 

We deduce from (27) that, on 0.(5), 

E Ki(/)i< E 



We choose fl»(<J) = fl(S) n SI2 to conclude the proof. 



9. Appendix 

In this Appendix, we recall the bound given by Bousquet [6] for the deviation 
of the supremum of the empirical process. 

Theorem 9.1. Let X\, X n be i.i.d. random variables valued in a measurable 
space (A, X). Let T be a class of real valued functions, defined on A and bounded 
by b. Let v 2 = sup /&F P[(/ - Pf) 2 } and Z = sup /6 ^-(P„ - P)f. Then, for all 
x > 0, 

P I Z > E(Z) + \ -(v 2 + 2bE(Z))x + — j < e~ x . (29) 
\ V n 3n I 

Let us recall some well known tools of empirical processes theory. 

Definition 9.2. The covering number N(e, T, d) is the minimal number of balls 
of radius e with centers in T needed to cover T . The entropy is the logarithm of 
the covering number H(e, T, d) = ln(iV(e, T, d)). 
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Definition 9.3. An e-separated subset of T is a subset of elements of T 

whose pairwise distance is strictly larger than e. The packing number M(e, T, d) 
is the maximum size of an e-separated subset of T . 

Those quantities are related by the famous following lemma. 

Lemma 9.4. (Kolmogorov and Tikhomirov [13]) Let (T,d) be a metric space 
and let e > 0, 

N{e,T,d) < M(e,T,d) < N(e/2,T,d). 

The following result can be derived from classical chaining arguments (see 
for example [6]). 

Lemma 9.5. Let J- be a class of functions, let d2,p n (t,t') = y/ P n [(t — t') 2 ] and 
D n = ^sup^jr P n (t 2 ) then 

E (sup|(P„ -P)t\j < ^E (J""'* H 1 / 2 (u,^,d 2 , P Jdv\ . 

The next result was used to obtain our concentration inequalities. 

Lemma 9.6. Let (A{)i^i be a collection of sets such that, for all i,j 6 I, 
Ai n Aj — $ and let (aj),gi be a collection of positive real numbers. Let Zi = 
sup tg _ F/ |(P„ — P)t\, where J~i = {ti = o^l^} and P n is the empirical measure. 
Let a* — sup i6/ a,, = sup ieJ a 2 P(Ai). We have 

In order to apply Lemma 9.5 to T = J-}, we compute the entropy of J 7 /. For all 
i 7^ j, since Ai n Aj = 0, 

(U - tj) 2 = (adA, - atjlAj) = a 2 lAi +a 2 lA r 

Hence d 2 , Pn (t i ,t j ) = Ja 2 P„(Ai) + a]P n (Aj). 

Consider an e-scparatcd set T £ = {ti 1 , ti N } in (Fi, d2,p n ) (see also the def- 
inition in the appendix), it comes from the previous computation that, for all 
k ^ k', 

a 2 lk Pn(A lk )+a\,P n (A lk ,)>e 2 . 

Hence, there is at least N—l indexes k E {1,...,N} such that a 2 P n (Ai k ) > e 2 /2. 
It follows that 

i = E > E > 

iei fe=l ^ ' 

Hence N < l + 2(a*) 2 e- 2 , thus H(e, J 7 /, d 2 ^p n ) < In (l + 2(a*) 2 e- 2 ) . We deduce 
from this inequality and Lemma 9.5 that 
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Efsup \(P n -P)t\) < 

32 
< -^E 
In \ j 



Pi/ 2 



v /ln(l + 2(a*) 2 e- 2 )rfe 



V / ln(2a*e- 1 )de , 



(31) 



where p* = sup^/ a 2 P n (Ai) . Now, let us recall the following elementary lemma. 
Lemma 9.7. For all positive real numbers K, A such that K/A > e, we have 



J s/MKx^dx < 2A J In 



Actually, 

f 

Jo 



k/a x 2 y \A 



2 J k/A u 2 \/\nu 



zdu. 



Since K/A > e, , \ < on \K/A, oof. The result follows. 

By definition, p n < (a*) 2 , hence 2a*/(y^*"/2) > 4 > e, we deduce from Lemma 
9.7 that 



E( sup \(P„-P)t\) < ^=E 
Let us now give another simple lemma. 



Lemma 9.8. The function f : x i— > x^J\a.(K/x), defined on {0-,K) is positive, 
non decreasing on (0,-ftT/e 1 / 2 ) and strictly concave. 

The proof of the lemma is straightforward from the computations 

1 , 1 1 



f'(x) = s/HKjx) - 



2y/\n{K/x) 



/"(*) = - 



2x v / ln( J K'/x) 4x(v / ln( J ftT/a;)) 3 



Applying Lemmas 9.8, 9.7, and Jensen's inequality to the right hand side of (31) 
we have that 



E 



sup \(p n -p)t\) < ^e(v^) \ 

ten J V n v ' \ 



In 



4a* 



'Pn) 



Now it comes from Jensen inequality that 



3 



< V®M <Vp T +\la*E[ B up|(P B -P)t 
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It is clear from its definition that p* < (a*) 2 - Moreover, as P n and P are 
probability measures, we have, for all t in J-j, |(P„ — P)t\ < 2a*. Hence, 

^/a*E (sup^j^ |(P„ — P)t|) < \/2a* . We deduce from these inequalities that 



V?+ya*E^sup |(P n -P)^ < (1 + V2)a* < (4a*)/e 1/2 . 
Hence, it comes from Lemma 9.8 that, if E = E (sup tg jr |(P n — P)f|) 




It is then straightforward that (30) holds. 
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